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ABSTRACT 


I 


C.mmp  is  a multi(minl)  processor  with  up  to  sixteen  processors.  This  paper  presents  and 
discusses  measurements  of  the  C.mmp  system  at  several  levels: 

1.  Basic  hardware  performance  measurements 

2.  Runtime  performance  of  Hydra,  C.mmp’s  operating  system 

3.  Overall  performance  of  a particular  application:  a parallel  rootfinding  algorithm. 

The  purpose  ot  this  paper  is  to  get  a detailed  look  at  the  performance  of  an  Implementation 
of  a parallel  program  on  C.mmp.  The  rootfinding  algorithm  was  chosen  because  It  meets  two 
constraints:  It  Is  a parallel  algorithm  with  significant  Interprocess  communication!  and  It  Is  of 
relatively  low  complexity;  enabling  us  to  focus  on  implementation  Issues  rather  than  subtleties 
In  the  algorithm  Itself. 

Variations  In  processor  speeds  and  asynchronously  executing  operating  system  functions 
are  shown  to  have  a detrimental  effect  on  the  rootfinder’s  performance.  However,  the  most 
important  implementation  decision  affecting  the  performance  of  the  rootfinding  program  is  the 
type  of  synchronization  semaphore  used.  We  define  the  threshold  for  practical  application  of 
a semaphore  to  be  when  507.  of  the  execution  time  Is  attributed  to  semaphore  related 
overheads.  Using  the  507.  criteria,  we  measured  thresholds  for  Inter-synchronization  times 
from  two  milliseconds  for  the  most  primitive  locks,  to  200  milliseconds  for  the  most 
sophisticated  and  flexible  semaphore.  During  the  course  of  these  measurements,  Hydra 
underwent  several  revisions  and  the  200  millisecond  threshold  was  reduced  to  33 
milliseconds.  The  principal  concept  responsible  for  this  performance  Improvement  Is 
discussed  In  the  paper. 
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1.  Introduction 

Most  papers  that  extol  the  virtues  ol  multiprocessor  computer  systems  cite  the  higher 
throughput  and  cost/performance  [eg.  Sauer  1977,  Fuller  1976]  over  the  more  traditional 
uniprocessor.  However,  both  ol  these  performance  advantages  can  be  realized  only  If  the 
software  effectively  exploits  the  parallelism  in  tho  machine.  To  date,  the  task  of  writing 
effective  parallel  software  Is  still  an  ad-hoc  procedure  of  constructing  code  for  a one  of  a 
kind  machine.  Since  multiprocessors  are  almost  as  different  from  One  another  as  they  are 
from  uniprocessors  it  Is  difficult  to  apply  insight  gained  from  writing  parallel  software  for  one 
multiprocessor  to  another  totally  different  machine.  Yet  by  documenting  the  performance  of 
various  implementations  of  several  algorithms  on  one  machine  we  can  shed  some  light  on  how 
effective  various  strategies  are  at  capturing  parallelism. 

The  purpose  of  this  paper  then  is  to  provide  a first-hand  look  at  the  Implementation  of 
parallel  algorithms  on  a multiprocessor.  The  nature  of  this  Investigation  Is  experimental 
rather  than  theoretical  in  that  the  results  we  present  are  derived  from  the  measurement  of 
real  programs  running  on  a real  multiprocessor  - C.mmp. 

The  basic  structure  of  C.mmp,  as  shown  In  the  PMS  diagram  of  Figure  1.1  Is  that  of  the 
canonical  multiprocessor.  A detailed  description  of  C.mmp  Is  provided  In  the  original  article  on 
C.mmp  by  Bell  and  Wulf  [1972],  but  the  following  description  should  provide  a sufficient 
background  for  this  Investigation. 

C.mmp  Is  organized  as  a system  ol  16  central  processors  (Pc's)  that  share  a centrally 
located  large  primary  memory  that  presently  consists  of  2.5  Megabytes.  The  16  Pc’s  are 
completely  asynchronous  computing  elements:  5 are  POP-1 1/20’s  and  the  remaining  11  are 
PDP-ll/AO’s.  They  are  connected  to  the  shared  primary  memory  via  a 16  x 16  crosspoint 
switch.  The  operation  of  the  switch  Is  similar  to  a 16  ported  memory  In  that  up  to  16 
memory  transactions  can  be  performed  simultaneously.  I/O  devices,  unlike  memory,  are 
associated  with  an  Individual  processor.  Thus  for  example,  an  1/0  request  to  a device  on 
Pc[0],  perhaps  a disk,  Is  performed  by  the  requesting  Pc  sending  an  Interprocessor  Interrupt 
to  Pc[0]  causing  Initiation  of  the  appropriate  1/0  Interrupt  service  routine  on  Pc[0). 


Hydra  Is  C.mmp’s  general-purpose  rpultiprogramming  operating  system  [Wulf  at  af.,  I97fli 
Wulf  el  aL,  1975|  Levin  «t  at,  1975].  It  is  a collection  of  basic  or  kernel  mechanisms  such  as 
memory  management,  process  dispatching,  and  message  passing.  Upon  this  core  an  arbitrary 


Not*  Kibi  tlandt  for 
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Figure  1.1  PMS  Diagram  of  C.mmp  (1977) 
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number  of  systems  crested  from  these  mechanisms  can  co-exist  simultaneously.  Hydra  It 
organized  as  a set  of  re-enterant  procedures  that  can  be  executed  by  any  of  the  processors. 
In  fact,  several  processors  can  simultaneously  execute  the  same  procedure.  This  concurrency 
is  accomplished  by  placing  locks  around  the  operating  system's  critical  data  structures.  These 
lochs  maintain  mutual  exclusion  where  necessary.  Throughout  this  paper  we  will  refer  to 
Hydra  as  the  Kernel  or  the  Operating  System. 

In  the  following  sections  we  develop  a parallel  algorithm  to  be  used  as  a case  study  and 
derive  its  theoretical  performance.  We  enumerate  the  contributions  to  performance 
fluctuation  and  degradation  from  several  sources  and  quantify  the  magnitude  of  each  source 
In  terms  of  the  program’s  performance.  One  dominant  influence  on  performance  Is  thp  process 
synchronization  mechanism.  We  compare  several  alternative  synchronization  mechanisms  and 
conclude  with  a graph  showing  the  range  of  Inter-synchronlzatlon  times  for  which  each 
mechanism  Is  preferable. 
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2.  Description  of  the  Rootfinding  Algorithm 

The  purpose  of  this  siudy  Is  to  present  quantitative  performance  results  for  Implementing 
parallel  algorithms  on  a multiprocessor.  Rather  than  attempting  to  measure  a broad  spectrum 
of  problems  we  have  chosen  to  study  various  implementations  of  a single  problem  in  order  to 
observe  and  measure  in  depth  the  performance  tradeoffs  In  the  Implementation  process. 

Two  criteria  that  our  case  study  problem  had  to  meet  were:  the  problem  must  be  complex 
enough  to  have  Interesting  implementation  tradeoffs  and  low  enough  complexity  to  permit  the 
focus  of  attention  on  Implementation  issues  rather  than  algorithm  issues.  The  candidate 
problem  we  finally  selected  Is  the  rootfinding  task. 

We  have  chosen  to  consider  this  problem  not  because  It  particularly  well-suited  for  parallel 
solution,  but  rather  because  it  is  a relatively  straight  forward  task  that  requires  a significant 
amount  of  Inter -process  communication.  According  to  Stone(1973],  algorithms  like  the 
rootfinding  algorithm  that  exhibit  speed-up  gains  proportional  to  the  logarithm  of  the  number 
of  processes  fall  Into  a class  of  problems  at  best  considered  poor  candidates  for  parallel 
processing.  However,  the  underlying  control  structure  present  In  this  procedure,  that  of  the 
synchronous  parallel  algorithm,  is  representative  of  many  parallel  decompositions  of 
otherwise  serial  algorithms.  For  this  reason  it  is  worthwhile  to  understand  the  nature  of  the 
control  structure  and  to  study  the  Influences  on  Its  performance.  Investigations  now  In 
progress  are  considering  larger  problems  and  alternative  control  structures  better  able  to 
exploit  the  available  parallelism  of  C.mmp  [Ofeinick  1978). 

Specifically  we  will  consider  the  problem  of  finding  the  root  of  a monotonlcally  Increasing 
function  In  a bounded  region,  tf  we  assume  no  special  information  about  the  behavior  of  the 
function,  the  best  procedure  for  a uniprocessor  under  these  circumstances  Is  a binary  search. 
An  obvious  decomposition  of  the  binary  search  Into  n parallel  processes  on  a multiprocessor 
Is  to  evaluate  the  function  simultaneously  at  n equidistant  points  within  the  bounded  region. 

The  optimal  placement  of  the  n processes  on  the  Interval  Is  known  [Rung  1976],  but  to 
minimize  the  complexity  of  the  algorithm  In  order  to  focus  on  the  synchronous  control 
structure  we  will  use  the  less  than  ideal  ,but  good,  technique  Illustrated  In  Figure  2.1.  The  n 
parallel  processes  perform  function  evaluations  at  the  n points  that  divide  the  Interval  Into 
n*l  equal  subintervals.  Since  our  function,  F(x),  Is  a monotonic  function,  the  sub-interval  that 
contains  the  root  Is  the  sub-interval  with  opposite  signs  for  F(x)  at  Its  end  points.  The  other 
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First  Iteration: 


Second  Iteration: 


Third  Iteration: 


Fourth  Iteration: 


P,  P,  p, 

1 2 3 


Figure  2.1  Rootfinding  Program  Using  Three  Processors 
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sub-intervals  are  discarded  and  the  procedure  repeats  this  basic  Iteration  until  one  of  the 
function  evaluations  Is  within  ( ,l.e.  an  acceptably  small  Interval  close  to  zero,  of  the 
zero-crosslnf. 


For  the  measurements  presented  here  the  function  we  are  evaluating  Is  the  normal 
Integral: 

1 .a  1 

h C.n 


F(x)  " j'x  •*P(-l/2c‘)  dc 


For  a < 2.32  the  following  truncated  power  series  was  used  to  evaluate  F(x): 

3 5 7 9 

( x + 7 + 3*5  + 3*5*7  + 1*5*7*7  * * * * ' " h 

and  for  larger  a we  used  the  continued  fraction: 

l/(  *+!/(  x+2.  ( x+3/  ( x+  •••))))  - h 


(2.2) 


(2. S'* 


We  selected  this  normal  integral  because  it  Is  an  important  transcendental  function  that 
exhibits  two  characteristics  important  to  our  measurement  studies:  it  requires  an  extensive 
amount  of  computation,  and  the  type  and  length  of  aomputation  are  data  dependent. 

In  order  to  evaluate  the  performance  of  Our  implementations  of  the  rootfinding  algorithm 
we  first  calculate  the  theoretical,  or  overload-free,  performance  curves. 

The  basic  cycle  in  the  rootfinder  is  the  independent  evaluation  of  the  function  by  each  of 
the  cooperating  processes  and,  upon  finishing,  the  communication  of  each  process  with  the 

other  processes  by  posting  the  results  of  its  function  evaluation.  Notice  that  the  new  interval 
is  not  located  until  all  of  the  processes  have  posted  their  results  When  the  last  process 

finishes  its  function  evaluation  it  assumes  the  jobs  of  finding  the  new  root-containing  Interval 
and  waking  up  all  of  the  waiting  processes.  This  basic  cycle  we  call  a STAGE. 

Under  Ideal  conditions  the  cooperating  processes  In  the  rootfinder  would  exhibit  the 
execution  behavior  found  In  Figure  2.2.  Each  process  performs  a function  evaluation 
Independently.  They  all  finish  at  the  same  instant  and,  after  a very  brief  bookkeeping 
calculation  perform  a new  F(x)  calculation,  on  an  interval  reduced  by  t/(n*l).  In  practice,  we 
seldom  find  this  to  be  the  case.  The  fluctuations  in  performance  stem  from  sources  Intrinsic 
to  the  multiprocessor  as  well  as  the  rootfinding  program. 


®The  new  Interval  Is  located  as  soon  as  the  sub-interval  Is  bounded  but  again  we  have 
opted  for  a more  straight-forward  algorithm  in  order  to  focus  on  the  Implementation  issues. 
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3.  Sources  of  Performance  Fluctuation 

3.1.  Introduction 

In  this  case  study  there  are  three  distinct  sources  of  performance  flu .tuatlon:  the  variation 
in  the  amount  of  computation  required  to  perform  a function  evaluation,  the  Individual 
hardware  elements’  performance  characteristics,  and  the  operating  system.  We  will  Identify 
the  nature  and  measure  the  magnitude  of  each  of  these  sources  starting  with  the  variation  In 
the  F(x)  calculation  as  It  is  the  most  straight  forward  of  the  three. 

3.2.  The  Variation  in  the  F(x)  Calculation 

The  elapsed  time  to  perform  a function  evaluation  is  data  dependent.  The  distribution  of 

the  compute  time  is  shaped  approximately  Normal  as  shown  is  Figure  3.1.  The  mean  is  about 
100  milliseconds  with  almost  an  equal  number  of  samples  on  each  side  of  the  mean^.  Thus 

we  might  model  the  expected  finishing  time  for  a process  performing  an  F(x)  calculation  to  be 
a random  variable  with  a Normal  distribution.  As  other  functions  would  have  other  compute 
time  distributions,  we  derive  the  theoretical  performance  for  the  constant  and  exponentl.il 
cases  also. 

Let  the  time  taken  by  the  stage  in  the  rootfinding  procedure  be  the  random  variable  Tj. 
Since  all  of  the  processes  are  performing  the  same  calculation,  each  process  executes  for  a 
random  amount  of  time,  t (see  figure  3.2),  taken  from  some  distribution.  Because  all  of  the 
processes  must  finish  their  function  evaluations  before  the  new  sub-interval  is  located 

Tt  - MAX(  tj,  c2,  t3,  ...  , tn  ) (3.1) 

From  elementary  order  statistics  the  expected  value  of  the  largest  order  statistic  In  random 
samples  of  n from  a parent  distribution  with  continuous  strictly  Increasing  cdf  P(x)  is 

E(  x(n))  " JT-  nx£  p<*>  j""1  dP(x>  (3 .2) 

If  we  know  nothing  about  the  distribution  of  the  t|  other  than  the  mean  u and  standard 
deviation  s,  the  expected  value  of  the  largest  order  statistic  Tj,  reduces  to 

*0n  »n  1 1/20  processor 


Figure  3.2  Performance  Degradation  Due  to  Variation  in  the  F(x)  Compute  Time 
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This  bound  can  bo  replaced  In  the  exponential  case  by  the  equality 
n-  l 

E ( T ) “ nu  ^ (°.S  (- 1) ^ (3.4) 

n --  J >) 

J-o  0+1) 

For  the  Normal  case  we  consult  Teichroew’s[1956]  tables  for  the  expected  value  of  the 
largest  order  statistic  drawn  from  the  N(0,1)  distribution. 

When  the  expected  value  of  the  compute  time  Is  a constant,  equation  3.3  Is  replaced  by  the 
simple  equality  E(Tj)  » u. 

tf  we  are  Interested  in  the  performance  speedwps  obtained  when  we  put  more  processes 
to  work  finding  roots,  we  need  to  estimate  the  average  time  to  locate  a root  as  a function  of 
the  number  of  processes.  Since  every  iteration  in  the  rootfinding  procedure  reduces  the 
interval  of  uncertainty,  L,  by  a factor  of  n»l  it  takes  Ceittng(Logn+ j L)  Iterations  to  locate  the 
root  in  a bounded  interval  of  length  L.  Thus  in  our  example  let  Rj  denote  the  number  of 
Iterations  necessary  to  arrive  within  ( of  the  root  using  i processes.  For  our  choice  of  (, 
R-{54,  34,  27,  23,  21,  19,  18,  17,  16,  16,  15,  15,..}  iterations.  Notice  that  it  takes  the  same 
number  of  Iterations  to  locale  the  root  using  nine  and  ten  or  eleven  and  twelve  processes. 
This  Is  because  the  number  of  Iterations  must  be  an  integer.  Thus,  there  Is  little  to  be  gained 
by  incorporting  many  processes  in  the  procedure.  In  this  study  the  maximum  number  of 
processes  we  will  use  is  nine. 


We  can  estimate  the  runtime  of  the  rootfinder  to  be  the  following: 

R 

n 

Runtime(n)  ■ Y T.  ■ R * E(  T ) (3.5) 

w k n n 

k-1 

Often  we  will  be  Interested  in  the  speedup  achieved  through  parallelism.  We  wilt  use  the 
following  formula  to  calculate  speedup: 


Speed  up(n) 


Runt lme ( 1) 
Runt ime (n) 


(3.0) 


Figure  3.3  Is  a plot  of  the  speedup  i/s.  number  of  processes  for  the  following  three 
distributions: 


Distribution 

Mean 

Siandftrd  OeyJihgn 

Constant 

1000 

0 

Normal 

1000 

278 

Exponential 

1000 

1000 

I 


Speed  up 


1 
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The  glitches  In  the  curves  are  a result  ot  the  Ctiling  (unction  In  the  equation  for  the 
number  of  iterations  to  perform.  Because  the  number  of  iterations  must  be  an  Integer  value, 
the  curves  are  not  smooth. 

This  figure  contains  calculated  no-overhead  performance  curves  for  three  sample  F(x) 
distributions  with  standard  deviations  ranging  from  0 to  u.  The  performance  range  Is  from 
negligible  speedup  when  the  compute  time  for  the  function  evaluation  is  exponential 
distributed  to  more  than  a factor  of  3 3 speedup  for  nine  processes  when  the  distribution  of 
the  F(x)  calculation  is  a constant.  The  Normal  curve  between  these  extremes  closely 
approximates  the  actual  F(x)  distrlbution'and  is  included  for  comparison. 

Another  way  to  view  this  data  Is  to  plot  speedup  for  the  nine  processes  case  vs.  the  ratio 
standard  deviation/mean  as  was  done  in  Figure  3.4.  This  figure  very  clearly  shows  the  Impact 
of  the  variance  on  the  performance  of  the  rootfinding  procedure.  When  the  coefficient  of 
variation  is  much  greater  than  one,  no  speedup  can  be  obtained  by  incorporating  multiple 
processes  In  the  rootfinding  task. 

Now  we  compare  the  calculated  no-overhead  performance  of  the  rootfinder  to  measured 
data  observed  on  the  machine.  In  order  to  measure  performance  as  a function  of  the 
distribution  of  the  F(x)  compute  time  a synthetic  rootfinder  was  developed  because  we  did 
not  want  to  limit  our  investigations  tp  particular  distributions  too  early  In  the  experiment.  The 
nature  of  the  calculation  was  still  the  real  function  evaluation,  however  the  length  of  time 
spent  computing  was  adjustable  to  reflect  the  distribution  under  consideration. 

Figure  3.5  graphs  performance  in  terms  of  elapsed  time  as  a function  of  the  number  of 
processes  for  three  distributions  of  the  F(x)  calculation.  In  each  case  we  compare  theoretical 
performance  to  measured  data.  Since  the  means  of  the  three  distributions  were  not  Identical 
the  data  points  for  the  single  process  instantiation  do  not  coincide.  Thus  In  this  graph 
comparisons  across  distributions  can  only  be  relative  approximations.  What  Is  Important  here 
is  how  close  the  measured  curves  are  to  their  theoretical  curves. 

For  each  single  process  instantiation  the  measured  and  theoretical  curves  are  far  apart. 
This  Is  because  any  pertubation  the  process  experiences  will  be  additive  and  will  lengthen 
the  basic  cycle  time. 

As  we  Incorporate  more  processes  the  constant  distribution  diverges  the  most  from  the 
theoretical  while  the  exponential  diverges  the  least.  The  reason  for  this  behavior  Is  that 


Speed  up 


Elapsed  Time  (Sec.) 


THE  IMPLEMENTATION  ANO  (VALUATION  OF  A PARALLEL  ALGORITHM  ON  CMMP 


PAGE  If 


pertubations  experienced  by  tbe  processes  will  tend  lo  increase  the  variance  of  the 
underlying  distribution.  However,  a small  change  In  the  variance  of  the  constant  distribution 
will  be  a much  larger  relative  change  than  a similar  change  to  the  exponential  distribution. 

That  the  observed  data  doesn't  agree  closely  with  the  calculated  curvet  it  evidence  that 
there  are  other  Influences  on  performance  besides  the  distribution  of  the  compute  time.  In 
the  following  sections  we  discuss  measurements  that  uncover  the  other  factors  Influencing 
performance. 

3.3.  The  Variation  in  Performance  of  Individual  Hardware  Element* 

The  fluctuations  in  performance  caused  by  the  hardware  will  always  be  present  because 
Hydra  allocates  C.mmp’s  resources  dynamically.  While  a user  cannot  accurately  predict  the 
exact  performance  of  his  processes,  he  can  estimate  the  magnitude  of  the  fluctuation  In 
performance  by  measuring  the  variation  in  the  performance  of  the  Individual  hardware 
elements. 

3.3.1.  Proceteor  Related  Variation* 

C.mmp  Is  a multiprocessor  constructed  from  PDP-11  model  40  and  model  20  minicomputers. 
In  Table  3.1  we  have  summarized  the  basic  performance  difference  between  the  processors 

by  comparing  their  execution  of  the  F(x)  calculation  without  the  presence  of  Hydra.  Each 
processor  performed  the  calculation  100  times  in  the  same  memory  port.  The  number  of 

MSYN*s^  was  counted  and  the  elapsed  time  measured.  These  figures  appear  In  the  first  and 
second  columns.  The  third  column  of  figures  Is  the  processor  speed  relative  to  PcfO). 


*MSYN  Is  the  DEC  name  for  the  signal  that  indicates  a request  is  being  made  for  the 
Unibus Since  only  the  processor  Is  making  requests  the  number  of  MSYNs  Is  the  number 
of  memory  requests  made  by  the  processor. 
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Pc 

Model 

Elapsed  Time  (sec.) 

kMsyn’s/sec 

Relative  to  PcfOl 

0 

11/20 

15.559 

443.3 

1.000 

1 

11/40 

10.413 

662.4 

1.494 

2 

11/40 

9 985 

6908 

1.558 

3 

11/40 

9.745 

707.8 

1.596 

4 

11/20 

16.144 

427.2 

0.963 

5 

11/40 

10.060 

685.7 

1.546 

6 

11/40 

10238 

673.7 

1.519 

7 

11/40 

9.829 

701.8 

1.582 

8 

11/20 

14.867 

463.9 

1.046 

9 

1 1/40 

10.022 

688.3 

1.552 

10 

11/40 

10.173 

678.0 

1.529 

n 

11/40 

10.001 

689.7 

1.555 

12 

11/40 

10.129 

681.0 

1.536 

13 

11/40 

10.005 

689.4 

1.555 

14 

11/20 

14.965 

460.9 

1.039 

15 

11/20 

14.999 

459.9 

1.037 

Table  3.1 

Speed  Variations  Among  ( 

'.mmp’s  Processors 

Naturally,  a process  on  an  11/40  should  execute  faster  than  a similar  process  on  an  11/20. 
Notice  that  even  among  processor  ol  the  same  type  there  can  be  more  than  a 52  difference 
In  speed. 

Because  there  are  two  types  of  processors,  the  strategy  of  dynamically  assigning 
processes  to  processors  Is  complex.  It  is  not  sufficient  to  schedule  a "ready"  process  to  the 
best  processor  available.  The  following  scenario  clearly  demonstrates  why. 

Suppose  that  the  rootfinding  processes  are  performing  their  function  evaluations  and  are 

finishing  at  random  times.  After  several  have  finished  one  would  expect  to  find  some  idle 
11/40’s  and  computing  11/20’s*.  A good  scheduler  should  handle  Its  resources  better.  The 

idle  11/40's  should  "steal"  the  processes  computing  on  the  11/20’s.  Naturally,  this 
philosophy  assumes  that  a context  su/ap  can  be  performed  quickly.  This  process  stealing 
philosophy  Is  the  scheduling  policy  on  C.mmp. 


* During  the  course  of  our  study  the  number  of  processors  running  In  the  system  varied 
day  to  day.  The  processor  configuration  during  the  experiment  with  the  synthetic  rodtfinder 
was  10  PDP-1 1/40’s  and  3 PDP-1 1/20’s.  Since  we  never  used  more  than  nine  processors  to 
perform  the  F(x)  calculation  all  of  our  processes  ran  exclusively  on  the  11/40‘s.  However, 
the  problem  Is  real.  If  we  could  have  incorporated  more  than  ten  processes  into  the 
rootfinding  procedure  we  would  have  had  to  deal  with  If.  Later  experiments  In  this  paper 
measure  the  Impact  of  the  non-homogenous  processor  configuration  as  the  number  of 
available  1 1/40’s  frequently  was  less  than  nine. 
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3.3.2.  Memory  Related  Variation* 


3.3.2. 1 . Technology  Differences 

C.mmp’i  centrally  located  primary  memory  is  also  a source  of  fluctuation  in  performance. 
The  memory  Is  divided  Into  16  modules,  or  banks.  Each  bank  can  service  memory  requests 
Independently.  However,  the  relative  speeds  of  the  banks  are  different  because  they  contain 
different  types  of  memory.  At  the  time  of  this  study  5 banks  contained  semiconductor 
memory  and  1 1 contained  magnetic  cores.  Table  3.2  summarizes  the  speed  differences  of  the 
memory  banks.  In  this  experiment  Pc[15]  performed  the  Rx)  calculation  100  times  In  each 


memory  bank.  The  elapsed  times  appear  in  the  table. 


Wfi 

Ttshnolofu 

Time  (see.) 

KMsyn'5/i*s 

Rtlitiyt  tfl  MpfQJ 

0 

core 

15.243 

452.5 

1.000 

1 

core 

14.943 

461.6 

1.020 

2 

core 

15.127 

456.0 

1.007 

3 

core 

14.999 

459.9 

1.016 

4 

core 

15.087 

457.2 

1.010 

5 

semiconductor 

15.950 

432.4 

0.955 

6 

core 

15.272 

451.6 

0.998 

7 

core 

15.402 

447.8 

0.989 

8 

semiconductor 

15.887 

434.2 

0.959 

9 

semiconductor 

15.858 

434.9 

0.961 

10 

semiconductor 

15.860 

434.9 

0.961 

11 

semiconductor 

15.855 

435.0 

0.961 

12 

core 

15.070 

457.7 

1.011 

13 

core 

15.155 

455.1 

1.005 

14 

core 

15.034 

458.8 

1.013 

15 

core 

15.013 

459.4 

1.015 

Table  3.2  Speed  Variation  among  C.mmp*s  Memory  Banks 


Even  among  memory  banks  of  the  same  technology,  speed  varies.  These  variations  are 
small  however,  and  are  caused  primarily  by  variations  In  the  length  of  cable  connecting  a 
memory  bank  to  the  crosspoint  switch  and  in  the  timing  circuitry  for  Individual  memory 
modules. 

3.3.2.2.  Memory  Bandwidth  and  Memory  Interference 

The  previous  experiments  show  the  rales  at  which  Individual  processors  and  memories  can 
communicate.  Another  Important  characteristic  Is  the  maximum  bandwidth  of  a memory  bank. 
This  rate  will  determine  how  many  processors  can  competa  for  cycles  In  a single  memory 
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bank  before  the  bank  Is  saturated  with  requests.  In  this  experiment  all  of  the  processors 
simultaneously  executed  the  tight  loop  in  the  same  memory  bank.  Two  banks  of  different 
types  were  chosen  to  be  representative  of  their  respective  technologies. 

The  results  In  Table  3.3  indicate  that  performance  degradation  will  occur  If  more  than  two 
or  three  processors  are  competing  for  cycles  In  a memory  bank.  This  Implies  that  sharing 
code,  a common  practice  to  conserve  memory  space,  will  result  In  slower  execution. 

Semiconductor  1.49*10®  memory  refs/sec. 

Core  1.71*10®  memory  refs/sec. 

Table  3.3  Maximum  Memory  Bandwidth 

In  tables  3.4  through  3.6  we  illustrate  the  performance  degradation  that  results  from 
sharing  code.  All  of  the  measurements  were  performed  on  Pc[0J.  In  each  case  100,000  total 

cycles  were  sampled.  The  first  column,  Memory  Cycle  Length,  is  the  elapsed  time  from  MSYN 
to  SSYN^,  a complete  memory  cycle. 

Table  3.4  Is  the  control  sample  where  we  monitored  the  memory  accesses  while  the  system 
was  Idle.  Although  the  vast  majority  of  cycles  were  in  the  500nr.  to  lua.  range  there  were 
some  cycles  that  were  greater  than  14ux.  This  Is  because  a processor  that  doesn’t  have  a 
process  to  execute  runs  a task  called  the  "idle  job".  The  Idle  Job  consists  of  a WAIT 
Instruction  followed  by  the  code  that  checks  to  see  if  there  is  a process  to  execute.  This 
piece  of  code  contains  a critical  section  guarded  by  a mutual  exclusion  busy-wait  loop.  Since 
all  of  the  processors  are  sharing  this  code  and  trying  to  gain  exclusive  access  to  the  critical 
section,  a great  deal  of  memory  contention  occurs  and  the  memory  cycle  lengths  grow  longer. 
We  will  use  this  table  to  compare  the  performance  of  the  rootfinding  processes  wfhen  they 
execute  from  one  common  code  page  and  when  they  each  have  a private  code  page. 

Table  3.5  contains  the  results  for  when  each  of  the  processes  executes  from  a private 
code  page.  Comparing  this  table  to  3.4  we  make  two  observations:  while  the  average 
memory  cycle  length  has  increased  slightly,  relatively  little  difference  exists  between  the  two 
tables:  the  one  category  where  a noticeable  change  does  occur  Is  the  long  (>  5.0  uj.)  cycles. 

®SSYN  Is  the  DEC  name  for  the  signal  that  indicates  the  completion  of  a bus  transfer.  It  is 
the  signal  the  memory  module  uses  to  tell  the  processor  that  the  memory  access  Is 
completed. 
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Less  then  half  as  many  long  cycles  now  occur  because  the  processors  are  Kept  busy 
executing  the  rootfinding  processes. 

Compare  these  two  tables  to  the  results  in  table  3.6  where  all  of  the  processes  share  one 
common  code  page.  Again  we  make  two  observations:  the  average  memory  cycle  length  has 
dramatically  increased  by  3007;  more  Important  still  is  that  the  percentage  of  long  cycles  (> 
5.0  us.)  has  increased  from  .0862  in  table  3.4  to  15.67.,  over  two  and  one-half  orders  of 
magnitude  more.  This  degradation  in  the  basic  cycle  time  will  offset  and  eventually  reverse 
speedup  obtained  by  Incorporating  multiple  processes  in  the  rootfinding  procedure. 


MEMORY  CYCLE 

LENGTH  READ 

READ-PAUSE 

WRITE 

WR1IE-BYIE 

0 - 0.5 

0 

0 

0 

0 

0.5  - 1.0 

65652 

. 7787 

14089 

902. 

1.0  - 2.0 

9470 

1926 

8 

0 

2.0  - 5.0 

63 

6 

2 

0 

5.0  -14.0 

63 

6 

10 

0 

14.0-50.0 

5 

2 

0 

0 

' > 50.0 

0 0 

Table  3.4  Histogram  for  Idle  System 

0 

0 

MEMORY  CYCLE 

LENGTH  READ 

READ-PAUSE 

WRITE 

WRITE-BYTE 

0 - 0.5 

0 

0 

0 

0 

0.5  - 1.0 

65827 

7461 

11024 

822 

1.0  - 2.0 

12705 

1133 

38 

0 

2.0  - 5.0 

894 

54 

10 

0 

5.0  -14.0 

28 

3 

0 

0 

14.0-50.0 

l 

0 

0 

0 

> 50.0 

0 

Table  3.5  Histogram 

0 0 

with  Private  Code  Pages 

0 

MEMORY  CYCLE  LENGTH  READ 

READ-PAUSE 

WRITE 

WRITE-BYTE 

0 - 0.5 

0 

0 

0 

0 

0.5  - 1.0 

52784 

6504 

9404 

761 

1.0  - 2.0 

10810 

689 

102 

0 

2.0  - 5.0 

3059 

201 

84 

0 

5.0  -14.0 

14291 

843 

287 

0 

1 4.0-50.0 

174 

4 

3 

0 

> 50.0 

0 

0 

0 

0 

Table  3.6  Histogram  with  Common  Code  Page 


Figure  3.6  captures  the  impact  of  the  finite  memory  bandwidth  problem  on  the  rootfinding 
procedure.  We  have  graphed  the  elapsed  time  to  locate  50  roots  versus  the  number  of 
processes  for  two  Implementations  of  the  rootfinding  procedure.  The  dashed  curve  results 
when  a single  copy  of  the  code  page  Is  shared.  The  solid  curve  Is  the  performance  when  the 
cooperating  processes  each  have  a copy  of  the  code  In  a private  memory  bank. 


Elapsed  Time  ( Sec.) 


TH(  IMPLEMENTATION  ANO  EVALUATION  Of  A PARALLEL  ALGORITHM  ON  CMMP 


PAGE  21 


THE  IMPLEMENTATION  AND  EVALUATION  OF  A PARALLEL  ALGORITHM  ON  CMMP  PAGE  22 

This  graph  also  can  provide  some  insight  into  the  speed  versus  space  tradeoff.  If  we 
compare  the  speedup  over  the  single  process  Instantiation  for  both  the  shared  and 
no-sharing  versions  of  the  rootllnder  we  find  that  the  no-sharing  version  has  a maximum 
speedup  of  2.60  using  nine  processes  while  the  shared  version’s  performance  peaks  at  1.53 
using  three  processes.  Neglecting  the  single  global  data  page  we  have  a achieved  a 170Z 
Intrease  In  speed  at  the  cost  of  a 3002  Increase  in  size.  In  this  study  memory  Is  plentiful  and 
we  squander  space  for  speed. 

One  solution  to  the  speed  vs.  size  tradeoff  is  to  interleave  the  central  memory  on  the  low 
order  bits  rather  than  the  high  order  bits.  This  solution  would  tend  to  scatter  memory 
requests  more  evenly  across  the  16  banks.  To  maintain  availability  It  might  be  necessary  to 
organize  the  store  as  four  banks  of  4-way  interleaved  memory.  A second  solution  is  to  give 
each  processor  a cache  to  work  with.  This  is  the  solution  currently  being  implemented  on 
C.mmp. 

3.4.  Operating  System  Related  Performance  Fluctuations 

3.4.1.  Introduction 

The  operating  system  also  perturbs  the  performance  of  the  rootfinding  procedure. 
Although  C.mmp  was  intended  to  be  a multi-user  multi-programming  facility,  it  is  possible  to 
use  the  machine  in  a dedicated  single  user  mode.  In  this  mode  of  operation  the  user  can 
minimize  any  interference  from  Hydra  by  simply  not  doing  anything  that  requires  operating 
system  assistance.  Most  of  the  measurements  In  this  study  were  performed  in  th>s  way. 
However,  certain  functions,  l.e.  scheduling  of  processes  and  1/0  interrupts,  must  be  performed 
by  Hydra  and  cannot  be  avoided.  The  contribution  to  performance  fluctuation  from  these 
basic  operating  system  functions  is  measured  and  discussed  In  the  following  sections. 

3.4.2.  The  Kernel  Tracer 


The  Kernel  Tracer  Is  a software  monitor  that  can  obtain  information  about  significant 
activity  on  C.mmp  under  the  Hydra  Operating  System.  Since  It  Is  a software  monitor,  the 
Tracer  does  perturb  the  timing  data  it  attempts  to  measure.  However,  this  can  be 
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compensated  for  In  the  post -processor  software 

The  Tracer  can  monitor  such  things  ass  contest  swaps  (this  occurs  when  a processor 
changes  trom  executing  one  process  to  executing  another),  semaphore  activity,  process  starts 
and  slops,  OS.  requests  (Kernel  Calls)  and  a multitude  ot  other  events.  Events  defined  by 
user  programs  may  also  be  traced. 

The  data  Is  collected  In  real  lime  and  later  post -processed  offline.  There  are  numerous 
post -processing  programs  that  produce  various  forms  of  output:  by  process  or  processor 
dumps,  time-line  execution  histories,  and  various  statistical  analysis  packages. 

All  of  the  Tracer  data  that  follows  Is  In  the  form  of  a processor  time-line  execution  history. 
We  use  various  symbols  In  the  trace  to  encode  events  In  order  to  compact  the  data.  Table  3.7 
contains  these  symbols  and  their  meanings.  Each  row  of  the  trace  represents  the  activity  on  a 
processor.  The  time  In  seconds  appears  along  the  bottom  edge.  We  will  discuss  In  detail  the 
first  trace  which  captures  the  Impact  of  I/O  interrupts  on  performance. 


3.4.3.  I/O  Devices  and  Interrupts 

Random  interrupts  from  I/O  devices  and  processors  contlbufe  to  performance  fluctuations 
In  the  rootfinder  processes.  Unlike  the  memory,  I/O  devices  are  not  centrally  located  and 
accessable  through  an  n x m crosspoint  switch.  Oevlces  are  associated  with  a particular 
processor.  Thus,  for  example,  a read  or  write  from  a disk  on  Pc[0]*s  Unlbus  must  be 
performed  by  processor  0 regardless  of  which  processor  Initiated  the  request.  Since 
Interrupts  are  serviced  by  stealing  cycles  from  the  currently  executing  process  large 
fluctuations  In  compute  times  can  be  found  for  processes  running  on  processors  with  I/O 
devices. 

In  Figure  3.7  Interrupts  associated  with  I/O  perturb  the  performance  of  the  rootfinding 
processes.  C.mmp's  processor  configuration  during  this  trace  was  Pc(0,  3,  4,  5,  6,  7,  R,  9,  1 1, 
12,  and  13]}  and  appear  from  bottom  to  lop  as  rows  of  the  trace.  Pc[0,  4,  and  8]  are 
PDP-lI/20s  and  the  rest  are  POP-1  l/40s.  Processcs(35,  43-50)  are  the  nine  rootfinding 
processes.  Process  29  and  the  DAEMON  are  other  processes  that  happened  to  be  awake  at 
the  time.  These  two  processes  are  doing  things  that  cause  a substantial  amount  of  I/O.  The 
following  discussion  describes  how  this  I/O  activity  perturbs  the  roolflndlng  processes. 
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PROCESS  N 
- CSW  - 
IOT  »X 
KALI  »X 
RET  X 

t 

I 

1 

EVENT  X 

P 

V 

DAEMON 


PROCESS  »N  IS  RUNNING 
A CONTEXT  SWAP 
SPECIAL  TYPE  Of  KERNEL  KALL 
KERNEL  KALL  •* 

RETURN  VALUE  FROM  A KERNEL  KALL 
START  Of  AN  INTERRUPT  AT  LEVEL  N 
INTERRUPT  SERVICE  ROUTINE  EXECUTION 
END  OF  AN  INTERRUPT 
USER  DEFINED  EVENT  X OCCURS 
P OPERATION  ON  A SEMAPHORE 
V OPERATION  ON  A SEMAPHORE 
OPERATING  SYSTEM  PROCESS 
IDLE  TIME 


Table  37  Tracer  Symbols 
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A previous  Iteration  finishes  at  0.6 1 2-seconds  Into  the  trace.  Process  50,  P(50),  on  Pc[ll] 
was  the  last  to  finish  Its  calculation  (the  activity  on  Pc{6]  Is  P(29))  and  begins  to  wake  Its 
sleeping  companions  by  unlocking  their  semaphores’  One  by  one  Ihe  processes  wake  up  and 
begin  to  perform  the  neat  iteration.  P(50)  finishes  waking  up  all  the  processes  ( P(49)  was 
the  last  to  wake  up  at  .641  ) and  begins  its  own  function  evaluation.  One  by  one  the 
processes  finish  their  calculations  and  post  their  results,  after  which  they  "P"  their 
semaphores  and  wait  for  the  beginning  of  the  next  Iteration.  When  they  block  on  the 
semaphore  they  are  removed  from  the  processor  ( e.g.  CSW  for  P(45)  on  Pc[5]  at  .700). 
Notice  that  four  of  the  piocessors  have  large  chunks  of  time  shaded  between  brackets.  This 
denotes  an  Interrupt  service  routine  performing  1/0  to  a device  on  that  Pc’s  Unibus. 
Interrupt  service  routines  can  consume  between  1 and  15  milliseconds  of  time.  This  causes 
the  rootfinding  process  on  that  Pc  to  arrive  at  the  synchronization  point  late,  thus 
lengthening  the  STAGE  time. 

For  example,  P(49)  on  Pc[8]  is  interrupted  at  .681  for  13  milliseconds  and  then  again  at 
.707  for  4 more  milliseconds.  Notice  however,  that  P(49)  on  Pc[8]  switches  to  Pc[6]  at  .709 
and  finishes  its  function  evaluation  at  .728  uninterrupted.  Since  It  is  the  last  process  to  finish 
it  assumes  the  jobs  of  finding  the  new  root  containing  subinterval  and  dispatching  the 
processes  lo  perform  Ihe  next  ileralion. 

In  this  example  the  Interrupted  process  was  delayed  enough  to  become  the  last  process  to 
finish  thus  lengthening  the  STAGE  time.  This  is  not  always  the  case.  For  example,  P(46)  on 
Pc[13]  was  also  Interrupted  during  its  (unction  evaluation  for  a approximately  21  milliseconds 
yet  It  was  not  the  last  to  finish  and  did  not  cause  the  STAGE  time  to  lengthen.  This  Is 
another  advantage  the  multiprocess  implementation  of  the  rootfinding  procedure  has  over  its 
uniprocess  counterpart.  In  the  single  process  instantiation  the  interrupt  time  Is  additive  and 
each  occurance  lengthens  the  Iteration.  In  the  multiprocess  version  only  the  Interrupt  time 
associated  with  the  last  process  to  finish  is  additive. 

3.4.4.  Kernel  Processes  and  Special  Functions 


Operating  system  requests  are  frequently  handled  by  special  high  priority  Kernel 
processes  and  as  such  perturb  the  cooperating  rootfinder  processes  by  stealing  processors. 
Of  particular  Interest  are  the  processes  that  perform  scheduling.  Because  synchronization  of 
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communicating  processes  can  Involve  rescheduling  the  processes,  the  special  scheduler 
processes  can  become  bottlenecks  causing  performance  degradations. 

During  the  trace  of  Figure  3 8,  C.mmp's  processor  configuration  was  Pc[0,  2,  4,  5,  6,  7,  8,  9, 
10,  11,  12,  and  13}.  Of  these,  4 and  8 are  1 1/20’s  (so  is  Pc[0])  and  are  the  third  and  seventh 
blank  columns  with  no  execution  history.  Since  enough  processors  of  the  prefered  (11/40) 
type  were  available  the  11/20’s  were  never  used.  Similarly  Pcf  12]  was  not  needed. 

In  this  trace  processes  (18,  19,  20,  21,  22)  are  rootfinding  processes.  Processes  1 and  2 
are  Kernel  scheduling  processes,  and  process  14  is  the  Tracer  process. 

P(22)  on  Pcf  10],  the  last  process  to  finish  the  previous  function  evaluation,  initializes  the 
necessary  parameters  for  the  next  iteration.  At  285  ms.  into  the  trace  (.285)  It  begins  to  V 
its  sleeping  companion  processes,  and  at  .309  it  begins  Its  own  function  evaluation  (event 
•372). 

Meanwhile  P(2)  on  Pc[6]  (scheduling  process)  wakes  up  CSW  at  .293  and  begins  to  perform 
the  task  of  actually  waking  up  the  processes  process  22  has  just  It  Is  a relatively 

painfull  task  involving  several  semaphore  operations  and  several  Kernel  calls  per  process. 
Finally  process  18  (the  first  to  be  17-ad)  wakes  up  and  begins  its  function  evaluation  at  .348, 
approximately  60  ms.  after  process  22  performed  the  V operation. 

To  expedite  the  costly  wake  up  procedure  processes  1 and  2 (scheduling  processes) 
cooperate  to  start  and  stop  the  rootfinding  processes.  Moreover,  by  the  time  they  get 
around  to  starting  process  21,  the  last  process  that  Is  to  wake  up,  three  of  the  other 
rootfinding  processes  have  already  finished  their  function  evaluations  and  have  gone  back  to 
sleep  (P  followed  by  CSW).  A full  130  ms.  have  transpired  since  process  22  performed  the  V 
to  wake  process  21. 

Another  side-effect  related  to  the  OS.  that  can  affect  the  performance  of  cooperating 
processes  Is  the  round-robin  scheduling  of  processes  under  Hydra.  This  traditional  policy  is 
Implemented  using  the  notion  of  "time -sliced"  Intervals  of  execution  to  provide  equal  service 
to  all  tasks.  Occasionally  a process  exhausts  Its  time  slices  and  must  ask  for  more.  This 
request  can  take  more  than  150  milliseconds  to  execute.  Whether  or  not  the  time-slice  end 
anomaly  wifi  perturb  the  performance  of  the  cooperating  processes  depends  upon  the 
average  duration  of  the  function  evaluation  and  the  frequency  of  the  time -slice  end  condition. 
In  this  study  a process  must  consume  10  one  half  second  slices  before  encountering  the 
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lime-slice  end  condition. 

Figure  3.9  Is  the  distribution  ol  the  elapsed  time  to  perform  an  F(x)  calculation  In  the 
presence  of  Hydra.  The  long  tail  in  the  distribution  is  a result  of  the  time-slice  end  condition 
occurring  for  the  process  performing  the  function  evaluation.  Compare  this  histogram  to  the 
one  In  Figure  3.1. 

3.5.  Summary 

The  sources  of  performance  fluctuation  we  have  discussed  can  be  classified  into  one  of 
three  types—  application,  hardware,  or  operating  system  related.  In  the  table  below  we  rank 
the  sources  of  pertubalion  by  their  potential  for  causing  performance  fluctuations.  Each 
source  Is  measured  and  the  observed  range  calculated  by  dividing  the  maximum  measurement 
by  the  minimum  observed  value.  The  larger  the  range,  the  more  potential  for  performance 
fluctuation. 

In  the  next  section  we  eliminate  several  sources  of  pertubatlon  in  order  to  measure  the 
performance  of  various  synchronization  primitives.  We  do  this  by  carefully  selecting 
processors  and  memory  banks  to  execute  the  rootfinding  program. 


Rank 

I*P.g 

Source 

Measurement 

Ranee 

1 

Application 

F(x)  Calculation 

Function  Evaluation 

1 : 3.4 

2 

Hardware 

Memory  Contention 

Average  Cycle  Length 

1 : 3.0 

3 

Operating  System 

Kernel  Processes 

Bottlenecking  of 
Scheduling  Processes 

1 : 2.8 

4 

Hardware 

Processors 

Speed 

1 : 1.6 

5 

Operating  System 

I/O  Devices  and 
Interrputs 

F(x)  Calculation 
Degradation 

1 : 1.3 

6 

Hardware 

Memories 

Speed 

1 : 1.07 

Table  3.8  The  Sources  of  Performance  Pertubatlon 
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4.  The  Effect  of  Sychronizafion  on  Performance 

4.1.  Introduction 

Newell  and  Roberlson[1975]  Identified  seven  programming  Issues  for  multiprocessor 
computer  systems.  Since  synchronization  of  cooperating  processes  Is  a fundamental  problem 
In  the  implementation  of  a parallel  algorithm  we  will  measure  the  performance  and  discuss  the 
tradeoffs  of  the  various  synchronization  mechanisms  available  to  the  C.mmp  user. 

Up  untill  now  we  have  used  a very  simple  form  of  "busy-waiting"  loop  to  synchronize  the 
cooperating  processes.  Although  synchronization  using  this  method  Is  extremely  fast, 
undesirable  side  effects  can  cause  serious  performance  problems.  We  wilt  discuss  several 
alternative  synchronization  mechanisms,  describe  their  functionality  and  any  Interesting  side 
effects,  compare  their  performance  in  the  context  of  the  rootfinding  algorithm,  and  conclude 
by  presenting  the  range  of  usefulness  for  each. 

4.2.  Description  of  Synchronization  Primitives 

We  first  examine  the  nature  of  the  synchronization  problem  for  the  rootfinding  processes. 
In  figure  4.1  we  present  a more  detailed  view  of  the  STAGE  time  and  In  particular  focus  on 
the  mechanics  of  synchronization.  The  segment  labeled  FIND  Is  the  time  spent  locating  the 
new  root  containing  sub-interval.  The  VtiTs  correspond  to  waking  up  each  of  the  rootfinding 
processes.  One  quickly  notices  that  the  overhead  of  synchronization  can  be  a significant  part 
of  the  STAGE  time  in  certain  instances.  Because  we  have  used  a spin  lock,  a form  of  busy 
waiting,  to  synchronize  the  processes,  the  overhead  of  synchronization  has  been  riegligable. 
However,  It  Is  not  always  desirable  to  implement  synchronization  with  this  mechanism. 


4.2.1.  The  Spin  Lock 

Of  the  three  synchronization  primitives  considered  In  this  study,  the  spin  lock  Is  the  most 
rudimentary.  This  primitive  Is  actually  implemented  Independently  of  any  Hydra  support  and 
Is  only  a tight  loop  in  which  the  process  continually  tests  a semaphore  until  It  can  set  it 
successfully.  The  P and  V operations  are  the  following  PDP-11  code  sequences: 


FOO  CALCULATION 


Figure  4.1  A Detailed  View  of  the  STAGE  Time 
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P:  CMP  SEMAPHORE.  «1 

BNE  P 

DEC  SEMAPHORE 
ONE  P 

V:  MOV  *1,  SEMAPHORE 

The  repeated  polling  of  the  semaphore, 
characteristics. 


[SEMAPHORE  - 1 ? 

[loop  until  it  is  - 1 

[decrement  SEMAPHORE 

;if  SEMAPHORE  neq  0 then  go  to  P 

[reset  SEMAPHORE  - 1 

although  extremely  fast,  has  two  very  nasty 


The  first  is  that  when  the  process  completes  its  function  evaluation  and  starts  to  poll  the 
semaphore  while  waiting  for  its  counterparts  for  finish,  the  processor  Is  not  free  to  perform 
useful  work. 


The  second  major  drawback  is  that  the  polling  process  consumes  many  cycles  in  the 
memory  bank  that  contains  the  semaphore.  As  more  process  finish  their  function  evaluations 
and  begin  to  poll  the  semaphore,  the  bandwidth  of  the  memory  bank  is  quickly  consumed. 
The  process  that  has  its  code  page  located  in  the  bank  with  the  semaphore  will  be  competing 
for  cycles  with  many  ‘busy"  processors.  This  second  problem  can  be  circumvented  by 
inserting  a tiny  delay  loop  in  the  semaphore  code,  i.e  , decrement  a register  to  zero  before 
checking  the  semaphore.  This  delay  will  decrease  the  frequency  of  memory  requests  in  the 
semaphore  memory  bank,  but  not  slow  the  sychroni^alion  primitive  appreciably.  However, 
the  primary  problem  still  remains:  a "spinning"  process  prevents  a processor  from  doing 
useful  work. 


4.2.2.  The  Kernel  Semaphore 

The  Kernel  semaphore  (K-SIM)  is  implemented  by  the  Hydra  operating  system.  It  is  the 
low  level  synchronisation  mechanism  used  by  system  processes.  When  a process  blocks  or 
wakes  up,  a state  change  for  that  process  Is  made  inside  the  Kernel.  Because  It  is 
implemented  wilhin  (he  domain  of  the  Kernel  the  user  evokes  operations  on  the  semaphore  (F 
and  V)  by  issuing  kernel  calls.  II  the  process  blocks  while  trying  to  F the  semaphore,  the 
Kernel  swaps  the  process  from  the  processor  and  places  the  process  In  the  semaphore's 
blocked-queue,  where  It  remains  until  another  process  V"t  the  semaphore.  When  the  process 
can  proceed  again,  it  is  swapped  back  onto  an  available  processor  and  continues  execution 
from  the  point  where  It  was  blocked.  The  Important  attributes  of  the  Kernel  semaphore  are: 

- A blocked  process  Is  swapped  from  a processor. 
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- When  a process  blocks  Us  pages  are  kepi  in  primary  memory.  This  ensures  that 
fhe  process  will  quickly  resume  execution  when  it  is  swapped  back  onto  a 
processor. 

- The  Kernel  semaphore  is  approximately  two  orders  of  magnitude  slower  than  the 
spin  lock. 


4.2.3.  The  Policy  Module  Semaphore 

The  policy  module  semaphore  (P-SEM)  is  implemented  by  the  scheduling  subsystem  called 
the  Policy  Module  (PM).  This  primitive  Is  intended  as  the  user's  primary  mechanism  for 
performing  synchronization. 

Because  the  synchronization  is  performed  within  the  context  of  a system  process,  more 
flexibility  is  available  in  handling  a blocking/waking  process.  The  first  policy  that  was 
adopted  to  handle  blocking/waking  processes  was  the  following: 

- Two  PM  processes  would  cooperate  to  perform  synchronization  operations  for 
users;  one  would  start  and  stop  processes  and  the  other  would  handle 
communication  between  the  Kernel  and  user. 

- When  a process  blocked  on  a semaphore  It  would  be  context  swapped  from  the 
processor. 

- Any  ’dirty’  pages  belonging  to  the  process  would  be  updated  on  secondary 
storage. 

- When  a process  was  to  wake  up  it  would  be  restarted  by  one  of  the  PM 
processes  after  all  the  swapped  out  pages  belonging  to  the  process  were 
brought  back  in  to  central  memory. 

This  policy  has  evolved  into  a much  faster  arrangement  of  multiple  processes  in  the  current 
version  of  the  PM. 

One  modification  to  the  PM  that  was  found  to  Improve  performance  substantially  was  to 
delay  the  updating  of  a process’  dirty  pages  onto  secondary  storage.  Often  o process  Is 
blocked  for  very  short  amounts  of  time  and  will  quickly  resume  execution  after  only  several 
milliseconds  of  waiting  for  a certain  condition  to  be  true.  However,  when  a page  is  to  be 
updated  onto  secondary  storage  It  Is  written  onto  one  of  several  IMS™  fixed  head  disks 
which  will  take  at  least  32  milliseconds  per  page.  The  swapping  disks  revolve  once  every 
16.67  milliseconds.  It  takes  two  revolutions  to  update  a page:  one  to  write  it  out  and  the 
second  to  perform  a read-check  operation  to  validate  the  copy.  Thus  it  Is  quite  possible  for 
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a process  to  spend  most  of  Its  time  blocking  and  unblocking  If  the  Inter-synchroniz.illon 
Interval  Is  small  enough.  The  problem  would  be  even  more  severe  If  there  wero  a task  force 
of  cooperating  processes,  eg  the  rootfinding  processes,  blocking  and  unblocking  every  few 
milliseconds. 


The  current  version  of  the  PM  initializes  the  delay  lime  parameter,  i,  to  300  milliseconds. 
Table  4.1  Is  a summary  of  the  time  It  taKes  to  perform  the  basic  semaphore  operations  on  the 
various  primitives. 


Measurement 

Time  for  a process 

Spin  Lock 

tS£M 

PMQ 

EMKl-QI 

EMlJl-3001 

to  do  a V (us.) 

30 

3000 

6000 

5000 

5000 

Time  till  a process 

wakes  up  from  a V (us.) 

30 

5000 

55000 

50000 

• 13000 

Time  from  P to  CSW  (us.) 

a 

3000 

9000 

O 

o 

'O 

tO 

6000 

Time  spent  in  PM  while 

waking  a process  (us.) 

na 

na 

62000 

20000 

0 

Table  *. ! Comparison  of  Execution  Times  for 
SemapMo-e  Primitive  Operations 


4.3.  The  Impact  of  Synchronization  on  Performance 

4.3.1.  Introduction 

Now  that  we  have  described  the  functionality  and  presented  the  Individual  performance 
statistics  tor  the  basic  primitive  operations,  we  can  observe  the  Impact  of  synchronization  on 
the  performance  of  the  rootfinder.  We  have  eliminated  most  of  the  overheads  associated 
with  synchronization  by  using  the  spin  lock  primitive.  The  remainder  of  the  paper  examines 
the  rootfinder's  performance  as  we  employ  the  alternative  synchronization  primitives. 

4.3.2.  Comparison  of  Primitive*  When  Compute  Time  ~ Synchronization  Time 

The  first  graph,  Figure  4.2,  compares  the  performance  of  the  various  Implementations  o! 
the  rootfinder  using  different  primitives  to  perform  the  process  synchronization.  We  have 
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Spm  loci 


Number  of  Process* 

Figure  4.2  A Performance  Comparison  of  Synchronisation  Primitives 
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plotted  the  elapsed  time  lo  find  bO  roots  as  a function  ol  the  number  of  processes.  This  data 

was  generated  by  the  authentic,  not  synthetic,  roootflndor.  The  distribution  of  the  F(x) 
computation  Is  approximately  Normal  with  mean  72  milliseconds  and  standard  deviation  18 

milliseconds We  compare  the  performance  of  lour  alternative  synchronization  primitives: 
spin  lock,  K-SEM,  PM1U-300),  and  PMO  semaphores. 

The  curve  for  the  PMO  semaphore  implementation  exhibits  degradation  as  we  Increase 
parallelism.  The  reason  for  this  behavior  is  that  the  Overhead  of  synchronization  Is  greater 
than  (tie  average  compute  time.  A process  spends  more  time  synchronizing  than  computing.  In 
this  Instance  we  would  be  better  off  using  a single  process. 

The  curve  for  the  PM1(<«300)  semaphore  Implementation  depicts  substantially  better 
performance  than  its  predecessor.  Performance  reaches  a maximum  speedup  of  2.00  at  six 
processes.  No  additional  speedup  is  gained  by  employing  more  processes.  Moreover,  a 
noticeable  degradation  occurs  at  nine  processes.  This  sudden  degradation  occurs  because  of 
the  non -homogenous  processor  configuration  (NMPC).  During  this  experiment  C.mmp's 
processor  configuration  was  eight  11/40's  and  one  ll/?0.  Thus  when  we  incorporated  the 
ninth  process,  It  ran  on  the  slower  11/20  type  processor.  The  STAGE  time  lengthed,  thus 
yielding  an  overall  slower  performance. 

The  K-SEM  implementation  has  its  peak  performance  of  2.4  at  eight  processes.  It  too  is 
affected  by  the  NMPC  proEdem  and  performance  degrades  slightly  at  nine  processes.  The 
overall  performance  of  the  K-SEM  implementation  Is  about  midway  between  the  PMl(<-300) 
and  the  spin  lock  versions. 

The  spin  lock  Implepienlation  has  by  tar  the  host  speed  up  maximum  of  about  2.8  tor  eight 

processes.  The  NMPC  problem  causes  a much  more  severe  performance  degradation  tor  this 
semaphore  than  for  the  others*.  The  reason  Is  that  the  processes  blocked  on  the  spin  lock 

semaphore  remain  on  their  processors,  whereas  the  other  Implementations  free  the  taster 
11/40  type  processors  to  steal  the  process  that  is  still  running  on  the  slower  11/20 
processor. 

*On  an  1 1/40  processor 

?The  PMO  implementation  performance  curve  has  a greater  degradation  than  the  spin  lock 
version,  Mowever,  the  reason  Is  not  merely  the  NMPC  problem.  The  primary  reason  Is  that 
the  two  PM  processes  that  perform  the  semaphore  operations  are  almost  constantly  running. 
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4.3.3.  Comparison  when  Compute  Time  is  Much  Greater  Than  Synchronization 
Time 

In  the  previous  experiment  the  overhead  of  synchronisation  was  In  some  cases  a 
considerable  traction  ol  the  STAGE  time.  It  we  make  the  compute  time  for  the  function 
evaluation  much  larger,  thus  reducing  the  percentage  ol  time  spent  synchronizing,  the 
performance  differences  between  the  various  implementations  Is  also  reduced.  Figure  4.3 
graphs  performance  In  terms  of  speed  up  as  a function  of  Ihe  number  of  processes.  We  used 
the  synthetic  rootflnder  again  to  generafe  F(x)  compulations  that  take  375  milliseconds  to 
compute  with  the  distribution  a constant.  The  dashed  curve  is  the  performance  obtained  using 
the  PMO  semaphore  and  the  solid  curve  the  performance  obtained  using  the  spin  lock. 

We  expected  the  curves  to  be  closer  together  yet  the  spin  lock  version  outperforms  the 
PMO  semaphore  2.8  to  2.1  at  maximum  speed  up.  The  reason  for  the  large  difference  is  that 
the  PM  processes  must  perform  the  semaphore  operations  serially,  each  V operation  taking 
about  fifty-five  milliseconds.  Thus  the  n,h  roolfinder  process  Is  not  started  until  55*n 
milliseconds  Into  the  STAGE  time.  In  this  manner  the  ninth  roolfinder  process  does  not 
complete  Its  function  evaluation  until  870  milliseconds  have  past.  Similarly,  when  the 
roolfinder  processes  complete  their  F(x)  calculations,  Ihe  PM  processes  again  serially  perform 
the  P operations  on  the  semaphores  causing  still  further  performance  degradations. 

The  severe  performance  degradation  that  occurs  at  eight  and  at  nine  processes  for  the 
spin-lock  Implementation  is  another  instance  of  the  NHPC  problem.  This  time,  with  only  seven 
1 1/40  type  processors,  performance  peaks  at  seven  processes,  declines  slightly  at  eight,  and 
then  plummets  from  a speed  up  of  more  than  2.7  to  slightly  more  than  2.0.  The  performance 
of  the  two  Implementations  is  nearly  identical  at  nine  processes. 

However,  In  Figure  4.4,  where  the  distribution  Is  exponential,  relatively  little  difference 
exists  between  the  performances  of  the  two  implementations.  Because  the  distribution  of  the 
compute  phase  causes  the  processes  to  arrive  at  random  times,  the  PM  does  not  become  a 
bottleneck  when  the  processes  finish  their  work.  When  they  are  restarted,  the  last  one  to  be 
started  Is  still  delayed  by  55»n  milliseconds.  However,  since  the  distribution  Is  exponential, 
the  process  that  must  compute  Ihe  function  evaluation  with  a compute  time  that  lies  In  the 
long  tail  of  the  distribution  always  finishes  last.  Thus  the  overhead  of  synchronization  Is 
again  hidden  by  the  MAX  function  that  governs  the  STAGE  time. 


Speed  Up 


Speed  Up 
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5.  Summary  of  Results:  The  Useful  Range  for  Various  Semaphores 

In  Figure  4 5 we  have  summarized  the  results  of  this  investigation  by  graphing  the  useful 
range  for  each  of  the  synchronization  primitives.  We  have  graphed  the  performance  of  the 
rootfinder  using  each  primitive  as  we  vary  the  size  of  the  compulation  phase  between 
synchronization  points.  For  each  point,  five  cooperating  processes  performed  1000  total 
function  evaluations  to  find  50  roots.  The  distribution  of  the  function  evaluation  was  a 
constant  and  ranged  In  size  from  2 milliseconds  to  375  milliseconds. 


The  NO-OVERHEAD  curve  Is  the  Ideal  performance  we  would  see  If  no  degradation  occured 
due  to  hardware,  operating  system  or  synchronization  overheads. 

The  50*  line  represents  our  threshold  for  adequate  performance.  It  parallels  the 
NO-OVERHEAD  curve  but  represents  exactly  half  of  the  performance  that  would  be  achieved 
In  the  best  case.  The  point  at  which  a performance  curve  crosses  the  50*  line  Is  the 
threshold  of  usability  for  that  synchronization  primitive. 


From  these  results  we  see  that  the  spin  lock  is  the  only  primitive  that  performs  adequately 
when  the  length  of  the  compute  phase  is  less  than  15  ms.  At  the  other  extreme,  all  of  the 
primitives  with  the  exception  of  the  Initial  version  of  the  policy-module  semaphore,  become 
Indistinguishable  beyond  400  ms.  In  the  region  between  these  two  endpoints  the  user  can 
select  the  appropriate  primitive  to  match  the  length  of  the  computation  phase.  The  cross-over 
points  for  the  various  semaphores  appear  in  the  table  below. 


Semaphore  Type 
Spin  Lock 
K-Sem 
PMl«-300) 
PMl«-0) 
PMO 


Cross-over  Point  (msecs.) 


2 

18 

33 

80 

200 


Table  4.2  Cross-over  Points  for  the  Various  Semaphores 


4 


Observed  Inter-Synchronization  Time  (milliseconds) 
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3.  Overall  performance  of  a particular  application:  a parallel  rootfinding  algorithm. 

>The  purpose  ot  this  paper  is  to  get  a detailed  lock  at  the  performance  of  an  implementation 
of  a parallel  program  on  C.mmp.  The  rcotfinding  algorithm  was  chosen  because  It  meets  two 
constraints:  it  is  a parallel  algorithm  with  significant  interprocess  communication:  and  It  is  of 
relatively  low  complexity,  enabling  us  to  focus  on  implementation  issues  rather  than  subtleties 
In  the  algorithm  itself. 

'".^Variations  in  processor  speeds  and  asynchronously  executing  operating  system  functions 
are  shown  to  have  a detrimental  effect  on  the  rootfinder’s  performance.  However,  the  most 
important  implementation  decision  affecting  the  performance  of  the  rootfinding  program  is  the 
type  of  synchronization  semaphore  used.  We  define  the  threshold  for  practical  application  of 
a semaphore  to  be  when  50^.  of  the  execution  time  is  attributed  to  semaphore  related 
overheads.  Using  the  50^  criteria,  we  measured  thresholds  for  inter-synchronization  times 
from  two  milliseconds  for  the  most  primitive  locks,  to  200  milliseconds  for  the  most 
sophisticated  and  flexible  semaphore.  During  the  course  of  these  measurements.  Hydra 
underwent  several  revisions  and  the  200  millisecond  threshold  was  reduced  to  33 
milliseconds.  The  principal  concept  responsible  for  this  performance  Improvement  Is 
discussed  in  the  paper. 
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